Again, one might think that a compiler should be able to automatically transform the combine3 code shown in Figure 5.9 to accumulate the value in a register, as it does with the code for combine4 shown in Figure 5.10.

In fact, however, the two functions can have different behavior due to memory aliasing. Consider, for example, the case of integer data with multiplication as the operation and 1 as the identity element. Let v be a vector consisting of the three elements [2, 3, 5] and consider the following two function calls:

```
combine3(v, get_vec_start(v) + 2);
combine4(v, get vec start(v) + 2);
```

That is, we create an alias between the last element of the vector and the destination for storing the result. The two functions would then execute as follows:

| Function | Initial   | Before Loop | i = 0     | i = 1     | i = 2      | Final      |
|----------|-----------|-------------|-----------|-----------|------------|------------|
| combine3 | [2, 3, 5] | [2, 3, 1]   | [2, 3, 2] | [2, 3, 6] | [2, 3, 36] | [2, 3, 36] |
| combine4 | [2, 3, 5] | [2, 3, 5]   | [2, 3, 5] | [2, 3, 5] | [2, 3, 5]  | [2, 3, 30] |

As shown above, combine3 accumulates its result at the destination, which in this case is the final vector element. This value is therefore set first to 1, then to  $2 \cdot 1 = 2$ , and then to  $3 \cdot 2 = 6$ . On the final iteration this value is then multiplied by itself to yield a final value of 36. For the case of combine4, the vector remains unchanged until the end, when the final element is set to the computed result  $1 \cdot 2 \cdot 3 \cdot 5 = 30$ .

Of course, our example showing the distinction between combine3 and combine4 is highly contrived. One could argue that the behavior of combine4 more closely matches the intention of the function description. Unfortunately, an optimizing compiler cannot make a judgement about the conditions under which a function might be used and what the programmer's intentions might be. Instead, when given combine3 to compile, it is obligated to preserve its exact functionality, even if this means generating inefficient code.

# 5.7 Understanding Modern Processors

Up to this point, we have applied optimizations that did not rely on any features of the target machine. They simply reduced the overhead of procedure calls and eliminated some of the critical "optimization blockers" that cause difficulties for optimizing compilers. As we seek to push the performance further, we must begin to consider optimizations that make more use of the means by which processors execute instructions and the capabilities of particular processors. Getting every last bit of performance requires a detailed analysis of the program as well as code generation tuned for the target processor. Nonetheless, we can apply some basic optimizations that will yield an overall performance improvement on a large class of processors. The detailed performance results we report here may not hold for other machines, but the general principles of operation and optimization apply to a wide variety of machines.

To understand ways to improve performance, we require a simple operational model of how modern processors work. Due to the large number of transistors that can be integrated onto a single chip, modern microprocessors employ complex hardware that attempts to maximize program performance. One result is that their actual operation is far different from the view that is perceived by looking at assembly-language programs. At the assembly-code level, it appears as if instructions are executed one at a time, where each



Figure 5.11: **Block Diagram of a Modern Processor.** The Instruction Control Unit is responsible for reading instructions from memory and generating a sequence of primitive operations. The Execution Unit then performs the operations and indicates whether the branches were correctly predicted.

instruction involves fetching values from registers or memory, performing an operation, and storing results back to a register or memory location. In the actual processor, a number of instructions are evaluated simultaneously. In some designs, there can be 80 or more instructions "in flight." Elaborate mechanisms are employed to make sure the behavior of this parallel execution exactly captures the sequential semantic model required by the machine-level program.

## 5.7.1 Overall Operation

Figure 5.11 shows a very simplified view of a modern microprocessor. Our hypothetical processor design is based loosely on the Intel "P6" microarchitecture [28], the basis for the Intel PentiumPro, Pentium II and Pentium III processors. The newer Pentium 4 has a different microarchitecture, but it has a similar overall structure to the one we present here. The P6 microarchitecture typifies the high-end processors produced by a number of manufacturers since the late 1990s. It is described in the industry as being *superscalar*, which means it can perform multiple operations on every clock cycle, and *out-of-order* meaning that the

order in which instructions execute need not correspond to their ordering in the assembly program. The overall design has two main parts. The *Instruction Control Unit* (ICU) is responsible for reading a sequence of instructions from memory and generating from these a set of primitive operations to perform on program data. The *Execution Unit* (EU) then executes these operations.

The ICU reads the instructions from an *instruction cache*—a special, high-speed memory containing the most recently accessed instructions. In general, the ICU fetches well ahead of the currently executing instructions, so that it has enough time to decode these and send operations down to the EU. One problem, however, is that when a program hits a branch, there are two possible directions the program might go. The branch can be *taken*, with control passing to the branch target. Alternatively, the branch can be *not taken*, with control passing to the next instruction in the instruction sequence. Modern processors employ a technique known as *branch prediction*, where they guess whether or not a branch will be taken, and they also predict the target address for the branch. Using a technique known as *speculative execution*, the processor begins fetching and decoding instructions at where it predicts the branch will go, and even begins executing these operations before it has been determined whether or not the branch prediction was correct. If it later determines that the branch was predicted incorrectly, it resets the state to that at the branch point and begins fetching and executing instructions in the other direction. A more exotic technique would be to begin fetching and executing instructions for both possible directions, later discarding the results for the incorrect direction. To date, this approach has not been considered cost effective. The block labeled *Fetch Control* incorporates branch prediction to perform the task of determining which instructions to fetch.

The *Instruction Decoding* logic takes the actual program instructions and converts them into a set of primitive *operations*. Each of these operations performs some simple computational task such as adding two numbers, reading data from memory, or writing data to memory. For machines with complex instructions, such as an IA32 processor, an instruction can be decoded into a variable number of operations. The details vary from one processor design to another, but we attempt to describe a typical implementation. In this machine, decoding the instruction

```
addl %eax, %edx
```

yields a single addition operation, whereas decoding the instruction

```
addl %eax,4(%edx)
```

yields three operations: one to *load* a value from memory into the processor, one to add the loaded value to the value in register %eax, and one to *store* the result back to memory. This decoding splits instructions to allow a division of labor among a set of dedicated hardware units. These units can then execute the different parts of multiple instructions in parallel. For machines with simple instructions, the operations correspond more closely to the original instructions.

The EU receives operations from the instruction fetch unit. Typically, it can receive a number of them on each clock cycle. These operations are dispatched to a set of *functional units* that perform the actual operations. These functional units are specialized to handle specific types of operations. Our figure illustrates a typical set of functional units. It is styled after those found in recent Intel processors. The units in the figure are as follows:

<sup>&</sup>lt;sup>1</sup>We use the term "branch" specifically to refer to conditional jump instructions. Other instructions that can transfer control to multiple destinations, such as procedure return and indirect jumps, provide similar challenges for the processor.

**Integer/Branch:** Performs simple integer operations (add, test, compare, logical). Also processes branches, as is discussed below.

General Integer: Can handle all integer operations, including multiplication and division.

Floating-Point Add: Handles simple floating-point operations (addition, format conversion).

**Floating-Point Multiplication/Division:** Handles floating-point multiplication and division. More complex floating-point instructions, such transcendental functions, are converted into sequences of operations.

**Load:** Handles operations that read data from the memory into the processor. The functional unit has an adder to perform address computations.

**Store:** Handles operations that write data from the processor to the memory. The functional unit has an adder to perform address computations.

As shown in the figure, the load and store units access memory via a *data cache*, a high-speed memory containing the most recently accessed data values.

With speculative execution, the operations are evaluated, but the final results are not stored in the program registers or data memory until the processor can be certain that these instructions should actually have been executed. Branch operations are sent to the EU not to determine where the branch should go, but rather to determine whether or not they were predicted correctly. If the prediction was incorrect, the EU will discard the results that have been computed beyond the branch point. It will also signal to the Branch Unit that the prediction was incorrect and indicate the correct branch destination. In this case the Branch Unit begins fetching at the new location. Such a *misprediction* incurs a significant cost in performance. It takes a while before the new instructions can be fetched, decoded, and sent to the execution units. We explore this further in Section 5.12.

Within the ICU, the *Retirement Unit* keeps track of the ongoing processing and makes sure that it obeys the sequential semantics of the machine-level program. Our figure shows a *Register File*, containing the integer and floating-point registers, as part of the Retirement Unit, because this unit controls the updating of these registers. As an instruction is decoded, information about it is placed in a first-in, first-out queue. This information remains in the queue until one of two outcomes occurs. First, once the operations for the instruction have completed and any branch points leading to this instruction are confirmed as having been correctly predicted, the instruction can be *retired*, with any updates to the program registers being made. If some branch point leading to this instruction was mispredicted, on the other hand, the instruction will be *flushed*, discarding any results that may have been computed. By this means, mispredictions will not alter the program state.

As we have described, any updates to the program registers occur only as instructions are being retired, and this takes place only after the processor can be certain that any branches leading to this instruction have been correctly predicted. To expedite the communication of results from one instruction to another, much of this information is exchanged among the execution units, shown in the figure as "Operation Results." As the arrows in the figure show, the execution units can send results directly to each other.

The most common mechanism for controlling the communication of operands among the execution units is called *register renaming*. When an instruction that updates register r is decoded, a tag t is generated

| Operation               | Latency | Issue Time |
|-------------------------|---------|------------|
| Integer Add             | 1       | 1          |
| Integer Multiply        | 4       | 1          |
| Integer Divide          | 36      | 36         |
| Floating-Point Add      | 3       | 1          |
| Floating-Point Multiply | 5       | 2          |
| Floating-Point Divide   | 38      | 38         |
| Load (Cache Hit)        | 3       | 1          |
| Store (Cache Hit)       | 3       | 1          |

Figure 5.12: **Performance of Pentium III Arithmetic Operations.** Latency represents the total number of cycles for a single operation. Issue time denotes the number of cycles between successive, independent operations. (Obtained from Intel literature).

giving a unique identifier to the result of the operation. An entry (r,t) is added to a table maintaining the association between each program register and the tag for an operation that will update this register. When a subsequent instruction using register r as an operand is decoded, the operation sent to the Execution Unit will contain t as the source for the operand value. When some execution unit completes the first operation, it generates a result (v,t) indicating that the operation with tag t produced value v. Any operation waiting for t as a source will then use v as the source value. By this mechanism, values can be passed directly from one operation to another, rather than being written to and read from the register file. The renaming table only contains entries for registers having pending write operations. When a decoded instruction requires a register r, and there is no tag associated with this register, the operand is retrieved directly from the register file. With register renaming, an entire sequence of operations can be performed speculatively, even though the registers are updated only after the processor is certain of the branch outcomes.

#### 5.7.2 Functional Unit Performance

Figure 5.12 documents the performance of some of basic operations for an Intel Pentium III. These timings are typical for other processors as well. Each operation is characterized by two cycle counts: the *latency*, indicating the total number of cycles the functional unit requires to complete the operation; and the *issue time*, indicating the number of cycles between successive, independent operations. The latencies range from one cycle for basic integer operations; several cycles for loads, stores, integer multiplication, and the more common floating-point operations; and then to many cycles for division and other complex operations.

As the third column in Figure 5.12 shows, several functional units of the processor are *pipelined*, meaning that they can start on a new operation before the previous one is completed. The issue time indicates the number of cycles between successive operations for the unit. In a pipelined unit, the issue time is smaller than the latency. A pipelined function unit is implemented as a series of stages, each of which performs part of the operation. For example, a typical floating-point adder contains three stages: one to process the exponent values, one to add the fractions, and one to round the final result. The operations can proceed through the stages in close succession rather than waiting for one operation to complete before the next begins. This capability can only be exploited if there are successive, logically independent operations to

be performed. As indicated, most of the units can begin a new operation on every clock cycle. The only exceptions are the floating-point multiplier, which requires a minimum of two cycles between successive operations, and the two dividers, which are not pipelined at all.

Circuit designers can create functional units with a range of performance characteristics. Creating a unit with short latency or issue time requires more hardware, especially for more complex functions such as multiplication and floating-point operations. Since there is only a limited amount of space for these units on the microprocessor chip, the CPU designers must carefully balance the number of functional units and their individual performance to achieve optimal overall performance. They evaluate many different benchmark programs and dedicate the most resources to the most critical operations. As Figure 5.12 indicates, integer multiplication and floating-point multiplication and addition were considered important operations in design of the Pentium III, even though a significant amount of hardware is required to achieve the low latencies and high degree of pipelining shown. On the other hand, division is relatively infrequent, and difficult to implement with short latency or issue time, and so these operations are relatively slow.

## 5.7.3 A Closer Look at Processor Operation

As a tool for analyzing the performance of a machine level program executing on a modern processor, we have developed a more detailed textual notation to describe the operations generated by the instruction decoder, as well as a graphical notation to show the processing of operations by the functional units. Neither of these notations exactly represents the implementation of a specific, real-life processor. They are simply methods to help understand how a processor can take advantage of parallelism and branch prediction in executing a program.

#### **Translating Instructions into Operations**

We present our notation by working with combine4 (Figure 5.10), our fastest code up to this point as an example. We focus just on the computation performed by the loop, since this is the dominating factor in performance for large vectors. We consider the cases of integer data with both multiplication and addition as the combining operations. The compiled code for this loop with multiplication consists of four instructions. In this code, register \*eax holds the pointer data, \*edx holds i, \*ecx holds x, and \*esi holds length.

Every time the processor executes the loop, the instruction decoder translates these four instructions into a sequence of operations for the Execution Unit. On the first iteration, with i equal to 0, our hypothetical machine would issue the following sequence of operations:

| Assembly Instructions    | Execution Unit Operations |               |        |
|--------------------------|---------------------------|---------------|--------|
| .L24:                    |                           |               |        |
| imull (%eax,%edx,4),%ecx | load (%eax, %edx.0, 4)    | $\rightarrow$ | t.1    |
|                          | imull t.1, %ecx.0         | $\rightarrow$ | %ecx.1 |
| incl %edx                | incl %edx.0               | $\rightarrow$ | %edx.1 |
| cmpl %esi,%edx           | cmpl %esi, %edx.1         | $\rightarrow$ | cc.1   |
| jl .L24                  | jl-taken cc.1             |               |        |

In our translation, we have converted the memory reference by the multiply instruction into an explicit load instruction that reads the data from memory into the processor. We have also assigned *operand labels* to the values that change each iteration. These labels are a stylized version of the tags generated by register renaming. Thus, the value in register <code>%ecx</code> is identified by the label <code>%ecx.0</code> at the beginning of the loop, and by <code>%ecx.1</code> after it has been updated. The register values that do not change from one iteration to the next would be obtained directly from the register file during decoding. We also introduce the label <code>t.1</code> to denote the value read by the load operation and passed to the <code>imull</code> operation, and we explicitly show the destination of the operation. Thus, the pair of operations

```
load (%eax, %edx.0, 4) \rightarrow t.1 imull t.1, %ecx.0 \rightarrow %ecx.1
```

indicates that the processor first performs a load operation, computing the address using the value of <code>%eax</code> (which does not change during the loop), and the value stored in <code>%edx</code> at the start of the loop. This will yield a temporary value, which we label <code>t.1</code>. The multiply operation then takes this value and the value of <code>%ecx</code> at the start of the loop and produces a new value for <code>%ecx</code>. As this example illustrates, tags can be associated with intermediate values that are never written to the register file.

The operation

```
incl %edx.0 \rightarrow %edx.1
```

indicates that the increment operation adds one to the value of %edx at the start of the loop to generate a new value for this register.

The operation

```
cmpl %esi, %edx.1 \rightarrow cc.1
```

indicates that the compare operation (performed by either integer unit) compares the value in <code>%esi</code> (which does not change in the loop) with the newly computed value for <code>%edx</code>. It then sets the condition codes, identified with the explicit label <code>cc.1</code>. As this example illustrates, the processor can use renaming to track changes to the condition code registers.

Finally, the jump instruction was predicted taken. The jump operation

```
jl-taken cc.1
```





Figure 5.13: **Operations for First Iteration of Inner Loop of combine4 for integer multiplication.** Memory reads are explicitly converted to loads. Register names are tagged with instance numbers.

checks whether the newly computed values for the condition codes (cc.1) indicate this was the correct choice. If not, then it signals the ICU to begin fetching instructions at the instruction following the jl. To simplify the notation, we omit any information about the possible jump destinations. In practice, the processor must keep track of the destination for the unpredicted direction, so that it can begin fetching from there in the event the prediction is incorrect.

As this example translation shows, our operations mimic the structure of the assembly-language instructions in many ways, except that they refer to their source and destination operations by labels that identify different instances of the registers. In the actual hardware, register renaming dynamically assigns tags to indicate these different values. Tags are bit patterns rather than symbolic names such as "%edx.1," but they serve the same purpose.

#### **Processing of Operations by the Execution Unit**

Figure 5.13 shows the operations in two forms: that generated by the instruction decoder and as a *computation graph* where operations are represented by rounded boxes and arrows indicate the passing of data between operations. We only show the arrows for the operands that change from one iteration to the next, since only these values are passed directly between functional units.

The height of each operator box indicates how many cycles the operation requires, that is, the latency of that particular function. In this case, integer multiplication imull requires four cycles, load requires three, and the other operations require one. In demonstrating the timing of a loop, we position the blocks vertically to represent the times when operations are performed, with time increasing in the downward direction. We can see that the five operations for the loop form two parallel chains, indicating two series of computations that must be performed in sequence. The chain on the left processes the data, first reading an array element from memory and then multiplying it times the accumulated product. The chain on the right processes the loop index i, first incrementing it and then comparing it to length. The jump operation checks the result of this comparison to make sure the branch was correctly predicted. Note that there are no outgoing arrows





Figure 5.14: **Operations for First Iteration of Inner Loop of combine4 for Integer Addition.** Compared to multiplication, the only change is that the addition operation requires only one cycle.

from the jump operation box. If the branch was correctly predicted, no other processing is required. If the branch was incorrectly predicted, then the branch function unit will signal the instruction fetch control unit, and this unit will take corrective action. In either case, the other operations do not depend on the outcome of the jump operation.

Figure 5.14 shows the same translation into operations but with integer addition as the combining operation. As the graphical depiction shows, all of the operations, except load, now require just one cycle.

#### **Scheduling of Operations with Unlimited Resources**

To see how a processor would execute a series of iterations, imagine first a processor with an unlimited number of functional units and with perfect branch prediction. Each operation could then begin as soon as its data operands were available. The performance of such a processor would be limited only by the latencies and throughputs of the functional units, and the data dependencies in the program. Figure 5.15 shows the computation graph for the first three iterations of the loop in combine4 with integer multiplication on such a machine. For each iteration, there is a set of five operations with the same configuration as those in Figure 5.13, with appropriate changes to the operand labels. The arrows from the operators of one iteration to those of another show the data dependencies between the different iterations.

Each operator is placed vertically at the highest position possible, subject to the constraint that no arrows can point upward, since this would indicate information flowing backward in time. Thus, the load operation of one iteration can begin as soon as the incl operation of the previous iteration has generated an updated value of the loop index.

The computation graph shows the parallel execution of operations by the Execution Unit. On each cycle, all of the operations on one horizontal line of the graph execute in parallel. The graph also demonstrates out-of-order, speculative execution. For example, the incl operation in one iteration is executed before the jl instruction of the previous iteration has even begun. We can also see the effect of pipelining. Each iteration requires at least seven cycles from start to end, but successive iterations are completed every 4 cycles. Thus, the effective processing rate is one iteration every 4 cycles, giving a CPE of 4.0.

The four-cycle latency of integer multiplication constrains the performance of the processor for this program. Each imull operation must wait until the previous one has completed, since it needs the result of



Figure 5.15: Scheduling of Operations for Integer Multiplication with Unlimited Number of Execution Units. The 4 cycle latency of the multiplier is the performance-limiting resource.



Figure 5.16: Scheduling of Operations for Integer Addition with Unbounded Resource Constraints. With unbounded resources the processor could achieve a CPE of 1.0.

this multiplication before it can begin. In our figure, the multiplication operations begin on cycles 4, 8, and 12. With each succeeding iteration, a new multiplication begins every fourth cycle.

Figure 5.16 shows the first four iterations of combine4 for integer addition on a machine with an unbounded number of functional units. With a single-cycle combining operation, the program could achieve a CPE of 1.0. We see that as the iterations progress, the Execution Unit would perform parts of seven operations on each clock cycle. For example, in cycle 4 we can see that the machine is executing the addl for iteration 1; different parts of the load operations for iterations 2, 3, and 4; the jl for iteration 2; the cmpl for iteration 3; and the incl for iteration 4.

#### **Scheduling of Operations with Resource Constraints**

Of course, a real processor has only a fixed set of functional units. Unlike our earlier examples, where the performance was constrained only by the data dependencies and the latencies of the functional units, performance becomes limited by resource constraints as well. In particular, our processor has only two units capable of performing integer and branch operations. In contrast, the graph of Figure 5.15 has three of these operations in parallel on cycles 3 and four in parallel on cycle 4.

Figure 5.17 shows the scheduling of the operations for combine4 with integer multiplication on a resource-constrained processor. We assume that the general integer unit and the branch/integer unit can each begin a new operation on every clock cycle. It is possible to have more than two integer or branch operations executing in parallel, as shown in cycle 6, because the imull operation is in its third cycle by this point.

With constrained resources, our processor must have some *scheduling policy* that determines which operation to perform when it has more than one choice. For example, in cycle 3 of the graph of Figure 5.15, we show three integer operations being executed: the jl of iteration 1, the cmpl of iteration 2, and the incl of iteration 3. For Figure 5.17, we must delay one of these operations. We do so by keeping track of



Figure 5.17: Scheduling of Operations for Integer Multiplication with Actual Resource Constraints. The multiplier latency remains the performance-limiting factor.



Figure 5.18: Scheduling of Operations for Integer Addition with Actual Resource Constraints. The limitation to two integer units constrains performance to a CPE of 2.0.

the *program order* for the operations, that is, the order in which the operations would be performed if we executed the machine-level program in strict sequence. We then give priority to the operations according to their program order. In this example, we would defer the incl operation, since any operation of iteration 3 is later in program order than those of iterations 1 and 2. Similarly, in cycle 4, we would give priority to the imull operation of iteration 1 and the jl of iteration 2 over that of the incl operation of iteration 3.

For this example, the limited number of functional units does not slow down our program. Performance is still constrained by the four-cycle latency of integer multiplication.

For the case of integer addition, the resource constraints impose a clear limitation on program performance. Each iteration requires four integer or branch operations, and there are only two functional units for these operations. Thus, we cannot hope to sustain a processing rate any better than two cycles per iteration. In creating the graph for multiple iterations of combine4 for integer addition, an interesting pattern emerges. Figure 5.18 shows the scheduling of operations for iterations 4 through 8. We chose this range of iterations because it shows a regular pattern of operation timings. Observe how the timing of all operations in iterations 4 and 8 is identical, except that the operations in iteration 8 occur eight cycles later. As the iterations proceed, the patterns shown for iterations 4 to 7 would keep repeating. Thus, we complete four iterations every eight

cycles, achieving the optimum CPE of 2.0.

## Summary of combine4 Performance

We can now consider the measured performance of combine4 for all four combinations of data type and combining operations:

| Function | Page | Method                  | Integer |      | Floatir | ng Point |
|----------|------|-------------------------|---------|------|---------|----------|
|          |      |                         | +       | *    | +       | *        |
| combine4 | 219  | Accumulate in temporary | 2.00    | 4.00 | 3.00    | 5.00     |

With the exception of integer addition, these cycle times nearly match the latency for the combining operation, as shown in Figure 5.12. Our transformations to this point have reduced the CPE value to the point where the time for the combining operation becomes the limiting factor.

For the case of integer addition, we have seen that the limited number of functional units for branch and integer operations limits the achievable performance. With four such operations per iteration, and just two functional units, we cannot expect the program to go faster than 2 cycles per iteration.

In general, processor performance is limited by three types of constraints. First, the data dependencies in the program force some operations to delay until their operands have been computed. Since the functional units have latencies of one or more cycles, this places a lower bound on the number of cycles in which a given sequence of operations can be performed. Second, the resource constraints limit how many operations can be performed at any given time. We have seen that the limited number of functional units is one such resource constraint. Other constraints include the degree of pipelining by the functional units, as well as limitations of other resources in the ICU and the EU. For example, an Intel Pentium III can only decode three instructions on every clock cycle. Finally, the success of the branch prediction logic constrains the degree to which the processor can work far enough ahead in the instruction stream to keep the execution unit busy. Whenever a misprediction occurs, a significant delay occurs getting the processor restarted at the correct location.

# 5.8 Reducing Loop Overhead

The performance of combine4 for integer addition is limited by the fact that each iteration contains four instructions, with only two functional units capable of performing them. Only one of these four instructions operates on the program data. The others are part of the loop overhead of computing the loop index and testing the loop condition.

We can reduce overhead effects by performing more data operations in each iteration, via a technique known as *loop unrolling*. The idea is to access and combine multiple array elements within a single iteration. The resulting program requires fewer iterations, leading to reduced loop overhead.

Figure 5.19 shows a version of our combining code using three-way loop unrolling. The first loop steps through the array three elements at a time. That is, the loop index i is incremented by three on each iteration, and the combining operation is applied to array elements i, i + 1, and i + 2 in a single iteration.

\_\_\_\_\_code/opt/combine.c

\_\_\_\_\_code/opt/combine.c

```
1 /* Unroll loop by 3 */
2 void combine5(vec ptr v, data t *dest)
3 {
       int length = vec length(v);
       int limit = length-2;
       data_t *data = get_vec_start(v);
       data t x = IDENT;
       int i;
8
10
       /* Combine 3 elements at a time */
       for (i = 0; i < limit; i+=3) {
           x = x OPER data[i] OPER data[i+1] OPER data[i+2];
12
13
14
       /* Finish any remaining elements */
15
       for (; i < length; i++) {</pre>
16
17
          x = x OPER data[i];
18
19
       *dest = x;
20 }
```

Figure 5.19: Unrolling Loop by 3. Loop unrolling can reduce the effect of loop overhead.





Figure 5.20: Operations for First Iteration of Inner Loop of Three-Way Unrolled Integer Addition. With this degree of loop unrolling we can combine three array elements using six integer/branch operations.

In general, the vector length will not be a multiple of 3. We want our code to work correctly for arbitrary vector lengths. We account for this requirement in two ways. First, we make sure the first loop does not overrun the array bounds. For a vector of length n, we set the loop limit to be n-2. We are then assured that the loop will only be executed when the loop index i satisfies i < n-2, and hence the maximum array index i+2 will satisfy i+2 < (n-2)+2=n. In general, if the loop is unrolled by k, we set the upper limit to be n-k+1. The maximum loop index i+k-1 will then be less than n. In addition to this, we add a second loop to step through the final few elements of the vector one at a time. The body of this loop will be executed between 0 and 2 times.

To better understand the performance of code with loop unrolling, let us look at the assembly code for the inner loop and its translation into operations.

| Assembly Instructions    | Execution Unit Operations |               |         |
|--------------------------|---------------------------|---------------|---------|
| .L49:                    |                           |               |         |
| addl (%eax,%edx,4),%ecx  | load (%eax, %edx.0, 4)    | $\rightarrow$ | t.1a    |
|                          | addl t.1a, %ecx.0c        | $\rightarrow$ | %ecx.1a |
| addl 4(%eax,%edx,4),%ecx | load 4(%eax, %edx.0, 4)   | $\rightarrow$ | t.1b    |
|                          | addl t.1b, %ecx.1a        | $\rightarrow$ | %ecx.1b |
| addl 8(%eax,%edx,4),%ecx | load 8(%eax, %edx.0, 4)   | $\rightarrow$ | t.1c    |
|                          | addl t.1c, %ecx.1b        | $\rightarrow$ | %ecx.1c |
| addl %edx,3              | addl %edx.0, 3            | $\rightarrow$ | %edx.1  |
| cmpl %esi,%edx           | cmpl %esi, %edx.1         | $\rightarrow$ | cc.1    |
| jl .L49                  | jl-taken cc.1             |               |         |

As mentioned earlier, loop unrolling by itself will only help the performance of the code for the case of integer sum, since our other cases are limited by the latency of the functional units. For integer sum, three-way unrolling allows us to combine three elements with six integer/branch operations, as shown in Figure 5.20. With two functional units for these operations, we could potentially achieve a CPE of 1.0. Figure 5.21



Figure 5.21: Scheduling of Operations for Three-Way Unrolled Integer Sum with Bounded Resource Constraints. In principle, the procedure can achieve a CPE of 1.0. The measured CPE, however, is 1.33.

shows that once we reach iteration 3 (i = 6), the operations would follow a regular pattern. The operations of iteration 4 (i = 9) have the same timings, but shifted by three cycles. This would indeed yield a CPE of 1.0.

Our measurement for this function shows a CPE of 1.33, that is, we require four cycles per iteration. Evidently some resource constraint we did not account for in our analysis delays the computation by one additional cycle per iteration. Nonetheless, this performance represents an improvement over the code without loop unrolling.

Measuring the performance for different degrees of unrolling yields the following values for the CPE

| Vector Length | Degree of Unrolling |      |      |      |      |      |
|---------------|---------------------|------|------|------|------|------|
|               | 1 2 3 4 8 10        |      |      |      |      |      |
| CPE           | 2.00                | 1.50 | 1.33 | 1.50 | 1.25 | 1.06 |

As these measurements show, loop unrolling can reduce the CPE. With the loop unrolled by a factor of two, each iteration of the main loop requires three clock cycles, giving a CPE of 3/2=1.5. As we increase the degree of unrolling, we generally get better performance, nearing the theoretical CPE limit of 1.0. It is interesting to note that the improvement is not monotonic—unrolling by three gives better performance than unrolling by four. Evidently the scheduling of operations on the execution units is less efficient for the latter case.

Our CPE measurements do not account for overhead factors such as the cost of the procedure call and of setting up the loop. With loop unrolling, we introduce a new source of overhead—the need to finish any remaining elements when the vector length is not divisible by the degree of unrolling. To investigate the impact of overhead, we measure the *net CPE* for different vector lengths. The net CPE is computed as the total number of cycles required by the procedure divided by the number of elements. For the different degrees of unrolling, and for two different vector lengths we obtain the following:

| Vector Length | Degree of Unrolling |      |      |      |      |      |  |  |
|---------------|---------------------|------|------|------|------|------|--|--|
|               | 1 2 3 4 8 1         |      |      |      |      |      |  |  |
| СРЕ           | 2.00                | 1.50 | 1.33 | 1.50 | 1.25 | 1.06 |  |  |
| 31 Net CPE    | 4.02                | 3.57 | 3.39 | 3.84 | 3.91 | 3.66 |  |  |
| 1024 Net CPE  | 2.06                | 1.56 | 1.40 | 1.56 | 1.31 | 1.12 |  |  |

The distinction between CPE and net CPE is minimal for long vectors, as seen with the measurements for length 1024, but the impact is significant for short vectors, as seen with the measurements for length 31. Our measurements of the net CPE for a vector of length 31 demonstrate one drawback of loop unrolling. Even with no unrolling, the net CPE of 4.02 is considerably higher than the 2.06 measured for long vectors. The overhead of starting and completing the loop becomes far more significant when the loop is executed a smaller number of times. In addition, the benefit of loop unrolling is less significant. Our unrolled code must start and stop two loops, and it must complete the final elements one at a time. The overhead decreases with increased loop unrolling, while the number of operations performed in the final loop increases. With a vector length of 1024, performance generally improves as the degree of unrolling increases. With a vector length of 31, the best performance is achieved by unrolling the loop by only a factor of three.

A second drawback of loop unrolling is that it increases the amount of object code generated. The object code for combine4 requires 63 bytes, whereas the object code with the loop unrolled by a factor of 16

requires 142 bytes. In this case, that seems like a small price to pay for code that runs nearly twice as fast. In other cases, however, the optimum position in this time-space tradeoff is not so clear.

## 5.9 Converting to Pointer Code

Before proceeding further, let us attempt one more transformation that can sometimes improve program performance, at the expense of program readability. One of the unique features of C is the ability to create and reference pointers to arbitrary program objects. Pointer arithmetic, in fact, has a close connection to array referencing. The combination of pointer arithmetic and referencing given by the expression \*(a+i) is exactly equivalent to the array reference a [i]. At times, we can improve the performance of a program by using pointers rather than arrays.

Figure 5.22 shows an example of converting the procedures combine4 and combine5 to pointer code, giving procedures combine4p and combine5p, respectively. Instead of keeping pointer data fixed at the beginning of the vector, we move it with each iteration. The vector elements are then referenced by a fixed offset (between 0 and 2) of data. Most significantly, we can eliminate the iteration variable i from the procedure. To detect when the loop should terminate, we compute a pointer dend to be an upper bound on pointer data.

Comparing the performance of these procedures to their array counterparts yields mixed results:

| Function    | Page | Method                  | Integer |      | hod Integer Floating |      | g Point |
|-------------|------|-------------------------|---------|------|----------------------|------|---------|
|             |      |                         | +       | *    | +                    | *    |         |
| combine4    | 219  | Accumulate in temporary | 2.00    | 4.00 | 3.00                 | 5.00 |         |
| combine4p   | 239  | Pointer version         | 3.00    | 4.00 | 3.00                 | 5.00 |         |
| combine5    | 234  | Unroll loop ×3          | 1.33    | 4.00 | 3.00                 | 5.00 |         |
| combine5p   | 239  | Pointer version         | 1.33    | 4.00 | 3.00                 | 5.00 |         |
| combine5x4  |      | Unroll loop ×4          | 1.50    | 4.00 | 3.00                 | 5.00 |         |
| combine5px4 |      | Pointer version         | 1.25    | 4.00 | 3.00                 | 5.00 |         |

For most of the cases, the array and pointer versions have the exact same performance. With pointer code, the CPE for integer sum with no unrolling actually gets worse by one cycle. This result is somewhat surprising, since the inner loops for the pointer and array versions are very similar, as shown in Figure 5.23. It is hard to imagine why the pointer code requires an additional clock cycle per iteration. Just as mysteriously, versions of the procedures with four-way loop unrolling yield a one-cycle-per-iteration improvement with pointer code, giving a CPE of 1.25 (five cycles per iteration) rather then 1.5 (six cycles per iteration).

In our experience, the relative performance of pointer versus array code depends on the machine, the compiler, and even the particular procedure. We have seen compilers that apply very advanced optimizations to array code but only minimal optimizations to pointer code. For the sake of readability, array code is generally preferable.

#### **Practice Problem 5.3:**

At times, GCC does its own version of converting array code to pointer code. For example, with integer data and addition as the combining operation, it generates the following code for the inner loop of a variant of combine5 that uses eight-way loop unrolling:

\_\_\_\_ code/opt/combine.c

```
_____code/opt/combine.c
1 /* Accumulate in local variable, pointer version */
2 void combine4p(vec ptr v, data t *dest)
3 {
       int length = vec length(v);
4
       data_t *data = get_vec_start(v);
5
       data t *dend = data+length;
       data t x = IDENT;
       for (; data < dend; data++)</pre>
9
        x = x OPER *data;
10
       *dest = x;
11
12 }
                                                          _____code/opt/combine.c
                           (a) Pointer version of combine4.
                                                         _____ code/opt/combine.c
_{\rm 1} /* Unroll loop by 3, pointer version */
2 void combine5p(vec ptr v, data t *dest)
3 {
       data t *data = get vec start(v);
4
       data t *dend = data+vec length(v);
       data t *dlimit = dend-2;
       data_t x = IDENT;
7
       /* Combine 3 elements at a time */
9
       for (; data < dlimit; data += 3) {</pre>
           x = x OPER data[0] OPER data[1] OPER data[2];
11
12
13
       /* Finish any remaining elements */
       for (; data < dend; data++) {</pre>
15
16
          x = x OPER data[0];
17
       *dest = x;
18
19 }
```

Figure 5.22: Converting Array Code to Pointer Code. In some cases, this can lead to improved performance.

(b) Pointer version of combine5

```
combine4: type=INT, OPER = '+'
     data in %eax, x in %ecx, i in %edx, length in %esi
1 .L24:
                                  loop:
    addl (%eax,%edx,4),%ecx Add data[i] to x
    incl %edx
    cmpl %esi, %edx
                                   Compare i:length
    il .L24
                                    If <, goto loop
                                (a) Array code
     combine4p: type=INT, OPER = '+'
     data in %eax, x in %ecx, dend in %edx
1 .L30:
    addl (%eax),%ecx
                                    Add data[0] to x
    addl $4,%eax
                                   data++
    cmpl %edx,%eax
                                    Compare data:dend
    jb .L30
                                    If <, goto loop
```

Figure 5.23: **Pointer Code Performance Anomaly.** Although the two programs are very similar in structure, the array code requires two cycles per iteration, while the pointer code requires three.

(b) Pointer code

```
1 .L6:
addl (%eax),%edx
addl 4(%eax),%edx
4 addl 8(%eax),%edx
  addl 12(%eax),%edx
5
addl 16(%eax),%edx
7
   addl 20(%eax),%edx
    addl 24(%eax),%edx
    addl 28(%eax),%edx
9
    addl $32,%eax
    addl $8,%ecx
11
    cmpl %esi, %ecx
12
    jl .L6
13
```

Observe how register %eax is being incremented by 32 on each iteration.

Write C code for a procedure combine5px8 that shows how pointers, loop variables, and termination conditions are being computed by this code. Show the general form with arbitrary data and combining operation in the style of Figure 5.19. Describe how it differs from our handwritten pointer code (Figure 5.22).

\_ code/opt/combine.c

```
1 /* Unroll loop by 2, 2-way parallelism */
2 void combine6(vec ptr v, data t *dest)
3 {
       int length = vec length(v);
4
       int limit = length-1;
       data t *data = get vec start(v);
6
       data t x0 = IDENT;
       data t x1 = IDENT;
8
       int i;
9
10
       /* Combine 2 elements at a time */
11
12
       for (i = 0; i < limit; i+=2) {
           x0 = x0 OPER data[i];
13
           x1 = x1 OPER data[i+1];
14
       }
15
17
       /* Finish any remaining elements */
       for (; i < length; i++) {
18
           x0 = x0 OPER data[i];
19
20
       *dest = x0 OPER x1;
21
22 }
                                                                   code/opt/combine.c
```

Figure 5.24: Unrolling Loop by 2 and Using Two-Way Parallelism. This approach makes use of the pipelining capability of the functional units.

# 5.10 Enhancing Parallelism

At this point, our programs are limited by the latency of the functional units. As the third column in Figure 5.12 shows, however, several functional units of the processor are *pipelined*, meaning that they can start on a new operation before the previous one is completed. Our code cannot take advantage of this capability, even with loop unrolling, since we are accumulating the value as a single variable x. We cannot compute a new value of x until the preceding computation has completed. As a result, the processor will *stall*, waiting to begin a new operation until the current one has completed. This limitation shows clearly in Figures 5.15 and 5.17. Even with unbounded processor resources, the multiplier can only produce a new result every four clock cycles. Similar limitations occur with floating-point addition (three cycles) and multiplication (five cycles).

#### 5.10.1 Loop Splitting

For a combining operation that is associative and commutative, such as integer addition or multiplication, we can improve performance by splitting the set of combining operations into two or more parts and combining



Figure 5.25: Operations for First Iteration of Inner Loop of Two-Way Unrolled, Two-Way Parallel Integer Multiplication. The two multiplication operations are logically independent.

the results at the end. For example, let  $P_n$  denote the product of elements  $a_0, a_1, \ldots, a_{n-1}$ :

$$P_n = \prod_{i=0}^{n-1} a_i$$

Assuming n is even, we can also write this as  $P_n = PE_n \times PO_n$ , where  $PE_n$  is the product of the elements with even indices, and  $PO_n$  is the product of the elements with odd indices:

$$PE_n = \prod_{i=0}^{n/2-2} a_{2i}$$

$$PO_n = \prod_{i=0}^{n/2-2} a_{2i+1}$$

Figure 5.24 shows code that uses this method. It uses both two-way loop unrolling to combine more elements per iteration, and two-way parallelism, accumulating elements with even index in variable x0, and elements with odd index in variable x1. As before, we include a second loop to accumulate any remaining array elements for the case where the vector length is not a multiple of 2. We then apply the combining operation to x0 and x1 to compute the final result.

To see how this code yields improved performance, let us consider the translation of the loop into operations for the case of integer multiplication:

| Assembly Instructions               | Execution Unit Operations |               |        |
|-------------------------------------|---------------------------|---------------|--------|
| .L151:                              |                           |               |        |
| <pre>imull (%eax,%edx,4),%ecx</pre> | load (%eax, %edx.0, 4)    | $\rightarrow$ | t.1a   |
|                                     | imull t.1a, %ecx.0        | $\rightarrow$ | %ecx.1 |
| imull 4(%eax,%edx,4),%ebx           | load 4(%eax, %edx.0, 4)   | $\rightarrow$ | t.1b   |
|                                     | imull t.1b, %ebx.0        | $\rightarrow$ | %ebx.1 |
| addl \$2,%edx                       | addl \$2, %edx.0          | $\rightarrow$ | %edx.1 |
| cmpl %esi,%edx                      | cmpl %esi, %edx.1         | $\rightarrow$ | cc.1   |
| jl .L151                            | jl-taken cc.1             |               |        |

Figure 5.25 shows a graphical representation of these operations for the first iteration (i = 0). As this diagram illustrates, the two multiplications in the loop are independent of each other. One has register ex as its source and destination (corresponding to program variable ex0), while the other has register ex0 as its source and destination (corresponding to program variable ex1). The second multiplication can start just one cycle after the first. This makes use of the pipelining capabilities of both the load unit and the integer multiplier.

Figure 5.26 shows a graphical representation of the first three iterations (i = 0, 2, and 4) for integer multiplication. For each iteration, the two multiplications must wait until the results from the previous iteration have been computed. Still, the machine can generate two results every four clock cycles, giving a theoretical CPE of 2.0. In this figure we do not take into account the limited set of integer functional units, but this does not prove to be a limitation for this particular procedure.

Comparing loop unrolling alone to loop unrolling with two-way parallelism, we obtain the following performance:

| Function | Page | Method                                     | Integer |      | Floating Poin |      |
|----------|------|--------------------------------------------|---------|------|---------------|------|
|          |      |                                            | +       | *    | +             | *    |
|          |      | Unroll ×2                                  | 1.50    | 4.00 | 3.00          | 5.00 |
| combine6 | 241  | Unroll $\times 2$ , Parallelism $\times 2$ | 1.50    | 2.00 | 2.00          | 2.50 |

For integer sum, parallelism does not help, as the latency of integer addition is only one clock cycle. For integer and floating-point product, however, we reduce the CPE by a factor of two. We are essentially doubling the use of the functional units. For floating-point sum, some other resource constraint is limiting our CPE to 2.0, rather than the theoretical value of 1.5.

We have seen earlier that two's complement arithmetic is commutative and associative, even when overflow occurs. Hence for an integer data type, the result computed by combine6 will be identical to that computed by combine5 under all possible conditions. Thus, an optimizing compiler could potentially convert the code shown in combine4 first to a two-way unrolled variant of combine5 by loop unrolling, and then to that of combine6 by introducing parallelism. This is referred to as *iteration splitting* in the optimizing compiler literature. Many compilers do loop unrolling automatically, but relatively few do iteration splitting.

On the other hand, we have seen that floating-point multiplication and addition are not associative. Thus, combine5 and combine6 could potentially produce different results due to rounding or overflow. Imagine, for example, a case where all the elements with even indices were numbers with very large absolute value, while those with odd indices were very close to 0.0. Then product  $PE_n$  might overflow, or  $PO_n$  might underflow, even though the final product  $P_n$  does not. In most real-life applications, however, such



Figure 5.26: Scheduling of Operations for Two-Way Unrolled, Two-Way Parallel Integer Multiplication with Unlimited Resources. The multiplier can now generate two values every 4 cycles.

patterns are unlikely. Since most physical phenomena are continous, numerical data tend to be reasonably smooth and well-behaved. Even when there are discontinuities, they do not generally cause periodic patterns that lead to a condition such as that sketched above. It is unlikely that summing the elements in strict order gives fundamentally better accuracy than does summing two groups independently and then adding those sums together. For most applications, achieving a performance gain of 2X outweighs the risk of generating different results for strange data patterns. Nevertheless, a program developer should check with potential users to see if there are particular conditions that may cause the revised algorithm to be unacceptable.

Just as we can unroll loops by an arbitrary factor k, we can also increase the parallelism to any factor p such that k is divisible by p. The following are some results for different degrees of unrolling and parallelism:

| Method                                     | Inte | eger | Floating Point |      |
|--------------------------------------------|------|------|----------------|------|
|                                            | +    | *    | +              | *    |
| Unroll ×2                                  | 1.50 | 4.00 | 3.00           | 5.00 |
| Unroll $\times 2$ , Parallelism $\times 2$ | 1.50 | 2.00 | 2.00           | 2.50 |
| Unroll ×4                                  | 1.50 | 4.00 | 3.00           | 5.00 |
| Unroll $\times 4$ , Parallelism $\times 2$ | 1.50 | 2.00 | 1.50           | 2.50 |
| Unroll ×8                                  | 1.25 | 4.00 | 3.00           | 5.00 |
| Unroll $\times 8$ , Parallelism $\times 2$ | 1.25 | 2.00 | 1.50           | 2.50 |
| Unroll $\times 8$ , Parallelism $\times 4$ | 1.25 | 1.25 | 1.61           | 2.00 |
| Unroll $\times 8$ , Parallelism $\times 8$ | 1.75 | 1.87 | 1.87           | 2.07 |
| Unroll $\times 9$ , Parallelism $\times 3$ | 1.22 | 1.33 | 1.66           | 2.00 |

As this table shows, increasing the degree of loop unrolling and the degree of parallelism helps program performance up to some point, but it yields diminishing improvement or even worse performance when taken to an extreme. In the next section, we will describe two reasons for this phenomenon.

#### 5.10.2 Register Spilling

The benefits of loop parallelism are limited by the ability to express the computation in assembly code. In particular, the IA32 instruction set only has a small number of registers to hold the values being accumulated. If we have a degree of parallelism p that exceeds the number of available registers, then the compiler will resort to *spilling*, storing some of the temporary values on the stack. Once this happens, the performance drops dramatically. This occurs for our benchmarks when we attempt to have p = 8. Our measurements show the performance for this case is worse than that for p = 4.

For the case of the integer data type, there are only eight total integer registers available. Two of these (%ebp and %esp) point to regions of the stack. With the pointer version of the code, one of the remaining six holds the pointer data, and one holds the stopping position dend. This leaves only four integer registers for accumulating values. With the array version of the code, we require three registers to hold the loop index i, the stopping index limit, and the array address data. This leaves only three registers for accumulating values. For the floating-point data type, we need two of eight registers to hold intermediate values, leaving six for accumulating values. Thus, we could have a maximum parallelism of six before register spilling occurs.

This limitation to eight integer and eight floating-point registers is an unfortunate artifact of the IA32 instruc-

tion set. The renaming scheme described previously eliminates the direct correspondence between register names and the actual location of the register data. In a modern processor, register names serve simply to identify the program values being passed between the functional units. IA32 provides only a small number of such identifiers, constraining the amount of parallelism that can be expressed in programs.

The occurrence of spilling can be seen by examining the assembly code. For example, within the first loop for the code with eight-way parallelism we see the following instruction sequence:

In this code, a stack location is being used to hold x6, one of the eight local variables used to accumulate sums. The code loads it into a register, multiplies it by one of the data elements, and stores it back to the same stack location. As a general rule, any time a compiled program shows evidence of register spilling within some heavily used inner loop, it might be preferable to rewrite the code so that fewer temporary values are required. These include explicitly declared local variables as well as intermediate results being saved to avoid recomputation.

#### Practice Problem 5.4:

The following shows the code generated from a variant of combine6 that uses eight-way loop unrolling and four-way parallelism.

```
1 .L152:
addl (%eax),%ecx
3 addl 4(%eax),%esi
4 addl 8(%eax),%edi
  addl 12(%eax),%ebx
addl 16(%eax),%ecx
    addl 20(%eax),%esi
7
    addl 24(%eax),%edi
    addl 28(%eax),%ebx
9
    addl $32, %eax
    addl $8,%edx
11
    cmpl -8(%ebp), %edx
12
    jl .L152
```

- A. What program variable has being spilled onto the stack?
- B. At what location on the stack?
- C. Why is this a good choice of which value to spill?

With floating-point data, we want to keep all of the local variables in the floating-point register stack. We also need to keep the top of stack available for loading data from memory. This limits us to a degree of parallelism less than or equal to 7.

| Function   | Page | Method                                     | Integer |       | Floatir | ng Point |
|------------|------|--------------------------------------------|---------|-------|---------|----------|
|            |      |                                            | +       | *     | +       | *        |
| combine1   | 211  | Abstract unoptimized                       | 42.06   | 41.86 | 41.44   | 160.00   |
| combine1   | 211  | Abstract -02                               | 31.25   | 33.25 | 31.25   | 143.00   |
| combine2   | 212  | Move vec_length                            | 20.66   | 21.25 | 21.15   | 135.00   |
| combine3   | 217  | Direct data access                         | 6.00    | 9.00  | 8.00    | 117.00   |
| combine4   | 219  | Accumulate in temporary                    | 2.00    | 4.00  | 3.00    | 5.00     |
| combine5   | 234  | Unroll ×4                                  | 1.50    | 4.00  | 3.00    | 5.00     |
|            |      | Unroll ×16                                 | 1.06    | 4.00  | 3.00    | 5.00     |
| combine6   | 241  | Unroll $\times 2$ , Parallelism $\times 2$ | 1.50    | 2.00  | 2.00    | 2.50     |
|            |      | Unroll $\times 4$ , Parallelism $\times 2$ | 1.50    | 2.00  | 1.50    | 2.50     |
|            |      | Unroll $\times 8$ , Parallelism $\times 4$ | 1.25    | 1.25  | 1.50    | 2.00     |
| Worst:Best |      |                                            | 39.7    | 33.5  | 27.6    | 80.0     |

Figure 5.27: **Comparative Result for All Combining Routines.** The best performing version is shown in bold face.

#### 5.10.3 Limits to Parallelism

For our benchmarks, the main performance limitations are due to the capabilities of the functional units. As Figure 5.12 shows, the integer multiplier and the floating-point adder can only initiate a new operation every clock cycle. This, plus a similar limitation on the load unit limits these cases to a CPE of 1.0. The floating-point multiplier can only initiate a new operation every two clock cycles. This limits this case to a CPE of 2.0. Integer sum is limited to a CPE of 1.0, due to the limitations of the load unit. This leads to the following comparison between the achieved performance versus the theoretical limits:

| Method            | Integer |      | Floating Point |      |
|-------------------|---------|------|----------------|------|
|                   | +       | *    | +              | *    |
| Achieved          | 1.06    | 1.25 | 1.50           | 2.00 |
| Theoretical Limit | 1.00    | 1.00 | 1.00           | 2.00 |

In this table, we have chosen the combination of unrolling and parallelism that achieves the best performance for each case. We have been able to get close to the theoretical limit for integer sum and product and for floating-point product. Some machine-dependent factor limits the achieved CPE for floating-point multiplication to 1.50 rather than the theoretical limit of 1.0.

# 5.11 Putting it Together: Summary of Results for Optimizing Combining Code

We have now considered six versions of the combining code, some of which had multiple variants. Let us pause to take a look at the overall effect of this effort, and how our code would do on a different machine.

Figure 5.27 shows the measured performance for all of our routines plus several other variants. As can

be seen, we achieve maximum performance for the integer sum by simply unrolling the loop many times, whereas we achieve maximum performance for the other operations by introducing some, but not too much, parallelism. The overall performance gain of 27.6X and better from our original code is quite impressive.

## **5.11.1** Floating-Point Performance Anomaly

One of the most striking features of Figure 5.27 is the dramatic drop in the cycle time for floating-point multiplication when we go from combine3, where the product is accumulated in memory, to combine4 where the product is accumulated in a floating-point register. By making this small change, the code suddenly runs 23.4 times faster. When an unexpected result such as this one arises, it is important to hypothesize what could cause this behavior and then devise a series of tests to evaluate this hypothesis.

Examining the table, it appears that something strange is happening for the case of floating-point multiplication when we accumulate the results in memory. The performance is far worse than for floating-point addition or integer multiplication, even though the number of cycles for the functional units are comparable. On an IA32 processor, all floating-point operations are performed in extended 80-bit) precision, and the floating-point registers store values in this format. Only when the value in a register is written to memory is it converted to 32-bit (float) or 64-bit (double) format.

Examining the data used for our measurements, the source of the problem becomes clear. The measurements were performed on a vector of length 1024 having element i equal to i+1. Hence, we are attempting to compute 1024!, which is approximately  $5.4 \times 10^{2639}$ . Such a large number can be represented in the extended-precision floating-point format (it can represent numbers up to around  $10^{4932}$ ), but it far exceeds what can be represented as a single precision (up to around  $10^{38}$ ) or double precision (up to around  $10^{308}$ ). The single precision case overflows when we reach i=34, while the double precision case overflows when we reach i=171. Once we reach this point, every execution of the statement

```
*dest = *dest OPER val;
```

in the inner loop of combine3 requires reading the value  $+\infty$ , from dest, multiplying this by val to get  $+\infty$  and then storing this back at dest. Evidently, some part of this computation requires much longer than the normal five clock cycles required by floating-point multiplication. In fact, running measurements on this operation we find it takes between 110 and 120 cycles to multiply a number by infinity. Most likely, the hardware detected this as a special case and issued a *trap* that caused a software routine to perform the actual computation. The CPU designers felt such an occurrence would be sufficiently rare that they did not need to deal with it as part of the hardware design. Similar behavior could happen with underflow.

When we run the benchmarks on data for which every vector element equals 1.0, combine3 achieves a CPE of 10.00 cycles for both double and single precision. This is much more in line with the times measured for the other data types and operations, and comparable to the time for combine4.

This example illustrates one of the challenges of evaluating program performance. Measurements can be strongly affected by characteristics of the data and operating conditions that initially seem insignificant.

| Function   | Page | Method                                     | Integer |       | Floating Point |       |
|------------|------|--------------------------------------------|---------|-------|----------------|-------|
|            |      |                                            | +       | *     | +              | *     |
| combine1   | 211  | Abstract unoptimized                       | 40.14   | 47.14 | 52.07          | 53.71 |
| combine1   | 211  | Abstract -02                               | 25.08   | 36.05 | 37.37          | 32.02 |
| combine2   | 212  | Move vec_length                            | 19.19   | 32.18 | 28.73          | 32.73 |
| combine3   | 217  | Direct data access                         | 6.26    | 12.52 | 13.26          | 13.01 |
| combine4   | 219  | Accumulate in temporary                    | 1.76    | 9.01  | 8.01           | 8.01  |
| combine5   | 234  | Unroll ×4                                  | 1.51    | 9.01  | 6.32           | 6.32  |
|            |      | Unroll ×16                                 | 1.25    | 9.01  | 6.33           | 6.22  |
| combine6   | 241  | Unroll $\times 4$ , Parallelism $\times 2$ | 1.19    | 4.69  | 4.44           | 4.45  |
|            |      | Unroll $\times 8$ , Parallelism $\times 4$ | 1.15    | 4.12  | 2.34           | 2.01  |
|            |      | Unroll $\times 8$ , Parallelism $\times 8$ | 1.11    | 4.24  | 2.36           | 2.08  |
| Worst:Best |      |                                            | 36.2    | 11.4  | 22.3           | 26.7  |

Figure 5.28: Comparative Result for All Combining Routines Running on a Compaq Alpha 21164 **Processor.** The same general optimization techniques are useful on this machine as well.

## 5.11.2 Changing Platforms

Although we presented our optimization strategies in the context of a specific machine and compiler, the general principles also apply to other machine and compiler combinations. Of course, the optimal strategy may be very machine dependent. As an example, Figure 5.28 shows performance results for a Compaq Alpha 21164 processor for conditions comparable to those for a Pentium III shown in Figure 5.27. These measurements were taken for code generated by the Compaq C compiler, which applies more advanced optimizations than GCC. Observe how the cycle times generally decline as we move down the table, just as they did for the other machine. We see that we can effectively exploit a higher (eight-way) degree of parallelism, because the Alpha has 32 integer and 32 floating-point registers. As this example illustrates, the general principles of program optimization apply to a variety of different machines, even if the particular combination of features leading to optimum performance depend on the specific machine.

## 5.12 Branch Prediction and Misprediction Penalties

As we have mentioned, modern processors work well ahead of the currently executing instructions, reading new instructions from memory, and decoding them to determine what operations to perform on what operands. This *instruction pipelining* works well as long as the instructions follow in a simple sequence. When a branch is encountered, however, the processor must guess which way the branch will go. For the case of a conditional jump, this means predicting whether or not the branch will be taken. For an instruction such as an indirect jump (as we saw in the code to jump to an address specified by a jump table entry) or a procedure return, this means predicting the target address. In this discussion, we focus on conditional branches.

In a processor that employs *speculative execution*, the processor begins executing the instructions at the predicted branch target. It does this in a way that avoids modifying any actual register or memory locations

```
1 absval:
                                              2
                                                   pushl %ebp
                                                   movl %esp, %ebp
                                                   movl 8(%ebp), %eax
                                                                          Get val
                         _ code/opt/absval.c
                                                   testl %eax,%eax
                                                                          Test it
                                              6
                                                   jge .L3
                                                                          If >0, gotoend
1 int absval(int val)
                                              7
                                                   negl %eax
                                                                          Else, negate it
2 {
                                              8 .L3:
                                                                          end:
      return (val<0) ? -val : val;
3
                                              9
                                                   movl %ebp, %esp
4 }
                                                   popl %ebp
                                             10
                                             11
                                                   ret
                         __ code/opt/absval.c
              (a) C code.
                                                        (b) Assembly code.
```

Figure 5.29: **Absolute Value Code** We use this to measure the cost of branch misprediction.

until the actual outcome has been determined. If the prediction is correct, the processor simply "commits" the results of the speculatively executed instructions by storing them in registers or memory. If the prediction is incorrect, the processor must discard all of the speculatively executed results, and restart the instruction fetch process at the correct location. A significant *branch penalty* is incurred in doing this, because the instruction pipeline must be refilled before useful results are generated.

Once upon a time, the technology required to support speculative execution was considered too costly and exotic for all but the most advanced supercomputers. Since around 1998, integrated circuit technology has made it possible to put so much circuitry on one chip that some can be dedicated to supporting branch prediction and speculative execution. At this point, almost every processor in a desktop or server machine supports speculative execution.

In optimizing our combining procedure, we did not observe any performance limitation imposed by the loop structure. That is, it appeared that the only limiting factor to performance was due to the functional units. For this procedure, the processor was generally able to predict the direction of the branch at the end of the loop. In fact, if it predicted the branch will always be taken, the processor would be correct on all but the final iteration.

Many schemes have been devised for predicting branches, and many studies have been made on their performance. A common heuristic is to predict that any branch to a lower address will be taken, while any branch to a higher address will not be taken. Branches to lower addresses are used to close loops, and since loops are usually executed many times, predicting these branches as being taken is generally a good idea. Forward branches, on the other hand, are used for conditional computation. Experiments have shown that the backward-taken, forward-not-taken heuristic is correct around 65% of the time. Predicting all branches as being taken, on the other other hand, has a success rate of only around 60%. Far more sophisticated strategies have been devised, requiring greater amounts of hardware. For example, the Intel Pentium II and III processors use a branch prediction strategy that is claimed to be correct between 90% and 95% of the time [29].

We can run experiments to test the branch predication capability of a processor and the cost of a misprediction. We use the absolute value routine shown in Figure 5.29 as our test case. This figure also shows the compiled form. For nonnegative arguments, the branch will be taken to skip over the negation instruction.

We time this function computing the absolute value of every element in an array, with the array consisting of various patterns of of +1s and -1s. For regular patterns (e.g., all +1s, all -1s, or alternating +1 and -1s), we find the function requires between 13.01 and 13.41 cycles. We use this as our estimate of the performance with perfect branch condition. On an array set to random patterns of +1s and -1s, we find that the function requires 20.32 cycles. One principle of random processes is that no matter what strategy one uses to guess a sequence of values, if the underlying process is truly random, then we will be right only 50% of the time. For example, no matter what strategy one uses to guess the outcome of a coin toss, as long as the coin toss is fair, our probability of success is only 0.5. Thus, we can see that a mispredicted branch with this processor incurs a penalty of around 14 clock cycles, since a misprediction rate of 50% causes the function to run an average of 7 cycles slower. This means that calls to absval require between 13 and 27 cycles depending on the success of the branch predictor.

This penalty of 14 cycles is quite large. For example, if our prediction accuracy were only 65%, then the processor would waste, on average,  $14 \times 0.35 = 4.9$  cycles for every branch instruction. Even with the 90 to 95% prediction accuracy claimed for the Pentium II and III, around one cycle is wasted for every branch due to mispredictions. Studies of actual programs show that branches constitute around 14 to 16% of all executed instructions in typical "integer" programs (i.e., those that do not process numeric data), and around 3 to 12% of all executed instructions in typical numeric programs[31, Sect. 3.5]. Thus, any wasted time due to inefficient branch handling can have a significant effect on processor performance.

Many data dependent branches are not at all predictable. For example, there is no basis for guessing whether an argument to our absolute value routine will be positive or negative. To improve performance on code involving conditional evaluation, many processor designs have been extended to include *conditional move* instructions. These instructions allow some forms of conditionals to be implemented without any branch instructions.

With the IA32 instruction set, a number of different cmov instructions were added starting with the PentiumPro. These are supported by all recent Intel and Intel-compatible processors. These instructions perform an operation similar to the C code:

```
if (COND)
 x = y;
```

where y is the source operand and x is the destination operand. The condition *COND* determining whether the copy operation takes place is based on some combination of condition code values, similar to the test and conditional jump instructions. As an example, the cmovll instruction performs a copy when the condition codes indicate a value less than zero. Note that the first 'l' of this instruction indicates "less," while the second is the GAS suffix for long word.

The following assembly code shows how to implement absolute value with conditional move.

```
movl 8(%ebp),%eax
movl %eax,%edx
negl %edx
testl %eax,%eax
Conditionally move %edx to %eax

fer val as result
Copy to %edx
Negate %edx
Test val
fonditionally move %edx to %eax

ff < 0, copy %edx to result</pre>
```

As this code shows, the strategy is to set val as a return value, compute -val, and conditionally move it to register %eax to change the return value when val is negative. Our measurements of this code shows that it runs for 13.7 cycles regardless of the data patterns. This clearly yields better overall performance than a procedure that requires between 13 and 27 cycles.

#### **Practice Problem 5.5:**

A friend of yours has written an optimizing compiler that makes use of conditional move instructions. You try compiling the following C code:

```
1 /* Dereference pointer or return 0 if null */
2 int deref(int *xp)
3 {
4     return xp ? *xp : 0;
5 }
```

The compiler generates the following code for the body of the procedure.

```
1 movl 8(%ebp),%edx
2 movl (%edx),%eax
3 testl %edx,%edx
4 cmovll %edx,%eax
Get *xp as result
Test xp
If 0, copy 0 to result
```

Explain why this code does not provide a valid implementation of deref

The current version of GCC does not generate any code using conditional moves. Due to a desire to remain compatible with earlier 486 and Pentium processors, the compiler does not take advantage of these new features. In our experiments, we used the handwritten assembly code shown above. A version using GCC's facility to embed assembly code within a C program (Section 3.15) required 17.1 cycles due to poorer quality code generation.

Unfortunately, there is not much a C programmer can do to improve the branch performance of a program, except to recognize that data-dependent branches incur a high cost in terms of performance. Beyond this, the programmer has little control over the detailed branch structure generated by the compiler, and it is hard to make branches more predictable. Ultimately, we must rely on a combination of good code generation by the compiler to minimize the use of conditional branches, and effective branch prediction by the processor to reduce the number of branch mispredictions.

# 5.13 Understanding Memory Performance

All of the code we have written, and all the tests we have run, require relatively small amounts of memory. For example, the combining routines were measured over vectors of length 1024, requiring no more than 8,096 bytes of data. All modern processors contain one or more *cache* memories to provide fast access to such small amounts of memory. All of the timings in Figure 5.12 assume that the data being read or written

code/ont/list

```
1 typedef struct ELE {
       struct ELE *next;
       int data;
4 } list ele, *list ptr;
6 static int list len(list ptr ls)
7 {
       int len = 0;
8
9
       for (; ls; ls = ls->next)
10
11
           len++;
12
       return len;
13 }
                                                                         _ code/opt/list.c
```

Figure 5.30: Linked List Functions. These illustrate the latency of the load operation.

is contained in cache. In Chapter 6, we go into much more detail about how caches work and how to write code that makes best use of the cache.

In this section, we will further investigate the performance of load and store operations while maintaining the assumption that the data being read or written are held in cache. As Figure 5.12 shows, both of these units have a latency of 3, and an issue time of 1. All of our programs so far have used only load operations, and they have had the property that the address of one load depended on incrementing some register, rather than as the result of another load. Thus, as shown in Figures 5.15 to 5.18, 5.21 and 5.26, the load operations could take advantage of pipelining to initiate new load operations on every cycle. The relatively long latency of the load operation has not had any adverse affect on program performance.

#### 5.13.1 Load Latency

As an example of code whose performance is constrained by the latency of the load operation, consider the function <code>list\_len</code>, shown in Figure 5.30. This function computes the length of a linked list. In the loop of this function, each successive value of variable <code>ls</code> depends on the value read by the pointer reference <code>ls->next</code>. Our measurements show that function <code>list\_len</code> has a CPE of 3.00, which we claim is a direct reflection of the latency of the load operation. To see this, consider the assembly code for the loop, and the translation of its first iteration into operations:

| Assembly Instructions | Execution Unit Operations |               |        |
|-----------------------|---------------------------|---------------|--------|
| .L27:                 |                           |               |        |
| incl %eax             | incl %eax.0               | $\rightarrow$ | %eax.1 |
| movl (%edx),%edx      | load (%edx.0)             | $\rightarrow$ | %edx.1 |
| testl %edx,%edx       | testl %edx.1,%edx.1       | $\rightarrow$ | cc.1   |
| jne .L27              | jne-taken cc.1            |               |        |



Figure 5.31: **Scheduling of Operations for List Length Function.** The latency of the load operation limits the CPE to a minimum of 3.0.

\_\_ code/opt/copy.c

code/opt/copy.c

```
1 /* Set element of array to 0 */
2 static void array clear(int *src, int *dest, int n)
       int i;
4
       for (i = 0; i < n; i++)
7
           dest[i] = 0;
8 }
10 /* Set elements of array to 0, unrolling by 8 */
11 static void array clear 8(int *src, int *dest, int n)
       int i;
13
       int len = n - 7;
14
15
       for (i = 0; i < len; i+=8) {
17
           dest[i] = 0;
           dest[i+1] = 0;
18
           dest[i+2] = 0;
19
           dest[i+3] = 0;
20
           dest[i+4] = 0;
21
22
           dest[i+5] = 0;
           dest[i+6] = 0;
23
           dest[i+7] = 0;
24
       }
25
       for (; i < n; i++)
26
           dest[i] = 0;
27
28 }
```

Figure 5.32: Functions to Clear Array. These illustrate the pipelining of the store operation.

Each successive value of register %edx depends on the result of a load operation having %edx as an operand. Figure 5.31 shows the scheduling of operations for the first three iterations of this function. As can be seen, the latency of the load operation limits the CPE to 3.0.

## 5.13.2 Store Latency

In all of our examples so far, we have interacted with the memory only by using the load operation to read from a memory location into a register. Its counterpart, the *store* operation, writes a register value to memory. As Figure 5.12 indicates, this operation also has a nominal latency of three cycles, and an issue time of one cycle. However, its behavior, and its interactions with load operations, involve several subtle issues.

As with the load operation, in most cases the store operation can operate in a fully pipelined mode, beginning

```
_ code/opt/copy.c
 1 /* Write to dest, read from src */
 2 static void write read(int *src, int *dest, int n)
 3 {
        int cnt = n;
 4
        int val = 0;
 6
        while (cnt--) {
             *dest = val;
 8
             val = (*src) + 1;
 9
10
11 }
                                                                               code/opt/copy.c
Example A: write read(&a[0],&a[1],3)
          Initial
                       Iter. 1
                                   Iter. 2
                                               Iter. 3
                           2
   cnt
               3
                                       1
                                                    0
         -10
              17
                     -10
                           0
                                 -10
                                       -9
                                              -10
                                                   -9
     а
   val
               0
                           -9
                                       -9
                                                   -9
```



**Example B**: write read(&a[0],&a[0],3)

Figure 5.33: Code to Write and Read Memory Locations, Along with Illustrative Executions. This function highlights the interactions between stores and loads when arguments src and dest are equal.

a new store on every cycle. For example, consider the functions shown in Figure 5.32 that set the elements of an array dest of length n to zero. Our measurements for the first version show a CPE of 2.00. Since each iteration requires a store operation, it is clear that the processor can begin a new store operation at least once every two cycles. To probe further, we try unrolling the loop eight times, as shown in the code for array\_clear\_8. For this one we measure a CPE of 1.25. That is, each iteration requires around ten cycles and issues eight store operations. Thus, we have nearly achieved the optimum limit of one new store operation per cycle.

Unlike the other operations we have considered so far, the store operation does not affect any register values. Thus, by their very nature a series of store operations must be independent from each other. In fact, only a load operation is affected by the result of a store operation, since only a load can read back the memory location that has been written by the store. The function write\_read shown in Figure 5.33 illustrates the potential interactions between loads and stores. This figure also shows two example executions of this



Figure 5.34: **Detail of Load and Store Units.** The store unit maintains a buffer of pending writes. The load unit must check its address with those in the store unit to detect a write/read dependency.

function, when it is called for a two-element array a, with initial contents -10 and 17, and with argument cnt equal to 3. These executions illustrate some subtleties of the load and store operations.

In example A of Figure 5.33, argument src is a pointer to array element a [0], while dest is a pointer to array element a [1]. In this case, each load by the pointer reference \*src will yield the value -10. Hence, after two iterations, the array elements will remain fixed at -10 and -9, respectively. The result of the read from src is not affected by the write to dest. Measuring this example, but over a larger number of iterations, gives a CPE of 2.00.

In example B of Figure 5.33(a), both arguments src and dest are pointers to array element a [0]. In this case, each load by the pointer reference \*src will yield the value stored by the previous execution of the pointer reference \*dest. As a consequence, a series of ascending values will be stored in this location. In general, if function  $write\_read$  is called with arguments src and dest pointing to the same memory location, and with argument cnt having some value n>0, the net effect is to set the location to n-1. This example illustrates a phenomenon we will call write/read dependency—the outcome of a memory read depends on a very recent memory write. Our performance measurements show that example B has a CPE of 6.00. The write/read dependency causes a slowdown in the processing.

To see how the processor can distinguish between these two cases and why one runs slower than another, we must take a more detailed look at the load and store execution units, as shown in Figure 5.34. The store unit contains a **store buffer** containing the addresses and data of the store operations that have been issued to the store unit, but have not yet been completed, where completion involves updating the data cache. This buffer is provided so that a series of store operations can be executed without having to wait for each one to update the cache. When a load operation occurs, it must check the entries in the store buffer for matching addresses. If it finds a match, it retrieves the corresponding data entry as the result of the load operation.

The assembly code for the inner loop, and its translation into operations during the first iteration, is as follows:



Figure 5.35: **Timing of write\_read for Example A.** The store and load operations have different addresses, and so the load can proceed without waiting for the store.

| Assembly Instructions | Execution Unit Operations     |         |
|-----------------------|-------------------------------|---------|
| .L32:                 |                               |         |
| movl %edx,(%ecx)      | storeaddr (%ecx)              |         |
|                       | storedata %edx.0              |         |
| movl (%ebx),%edx      | load (%ebx) $ ightarrow$      | %edx.1a |
| incl %edx             | incl %edx.1a $ ightarrow$     | %edx.1b |
| decl %eax             | decl $\$$ eax.0 $\rightarrow$ | %eax.1  |
| jnc .L32              | jnc-taken cc.1                |         |

Observe that the instruction movl %edx, (%ecx) is translated into two operations: the storeaddr instruction computes the address for the store operation, creates an entry in the store buffer, and sets the address field for that entry. The storedata instruction sets the data field for the entry. Since there is only one store unit, and store operations are processed in program order, there is no ambiguity about how the two operations match up. As we will see, the fact that these two computations are performed independently can be important to program performance.

Figure 5.35 shows the timing of the operations for the first two iterations of write\_read for the case of example A. As indicated by the dotted line between the storeaddr and load operations, the storeaddr operation creates an entry in the store buffer, which is then checked by the load. Since these are unequal, the load proceeds to read the data from the cache. Even though the store operation has not been completed, the processor can detect that it will affect a different memory location than the load is trying to read. This process is repeated on the second iteration as well. Here we can see that the storedata operation must wait until the result from the previous iteration has been loaded and incremented. Long before this, the storeaddr operation and the load operations can match up their adddresses, determine they are different, and allow the load to proceed. In our computation graph, we show the load for the second iteration beginning just one cycle after the load from the first. If continued for more iterations, we would find the



Figure 5.36: **Timing of write\_read for Example B.** The store and load operations have the same address, and hence the load must wait until it can get the result from the store.

graph indicates a CPE of 1.0. Evidentally, some other resource constraint limits the actual performance to a CPE of 2.0.

Figure 5.36 shows the timing of the operations for the first two iterations of write\_read for the case of example B. Again, the dotted line between the storeaddr and load operations indicates that the the storeaddr operation creates an entry in the store buffer which is then checked by the load. Since these are equal, the load must wait until the storedata operation has completed, and then it gets the data from the store buffer. This waiting is indicated in the graph by a much more elongated box for the load operation. In addition, we show a dashed arrow from the storedata to the load operations to indicate that the result of the storedata is passed to the load as its result. Our timings of these operations are drawn to reflect the measured CPE of 6.0. Exactly how this timing arises is not totally clear, however, and so these figures are intended to be more illustrative than factual. In general, the processor/memory interface is one of the most complex portions of a processor design. Without access to detailed documentation and machine analysis tools, we can only give a hypothetical description of the actual behavior.

As these two examples show, the implementation of memory operations involves many subtleties. With operations on registers, the processor can determine which instructions will affect which others as they are being decoded into operations. With memory operations, on the other hand, the processor cannot predict which will affect which others until the load and store addresses have been computed. Since memory

operations make up a significant fraction of the program, the memory subsystem is optimized to run with greater parallelism for independent memory operations.

#### **Practice Problem 5.6:**

As another example of code with potential load-store interactions, consider the following function to copy the contents of one array to another:

```
1 static void copy_array(int *src, int *dest, int n)
2 {
3     int i;
4     
5     for (i = 0; i < n; i++)
6         dest[i] = src[i];
7 }</pre>
```

Suppose a is an array of length 1000 initialized so that each element a [i] equals i.

- A. What would be the effect of the call copy array (a+1, a, 999)?
- B. What would be the effect of the call copy array(a, a+1, 999)?
- C. Our performance measurements indicate that the call of part A has a CPE of 3.00, while the call of part B has a CPE of 5.00. To what factor do you attribute this performance difference?
- D. What performance would you expect for the call copy array (a, a, 999)?

# 5.14 Life in the Real World: Performance Improvement Techniques

Although we have only considered a limited set of applications, we can draw important lessons on how to write efficient code. We have described a number of basic strategies for optimizing program performance:

- 1. High-level design. Choose appropriate algorithms and data structures for the problem at hand. Be especially vigilant to avoid algorithms or coding techniques that yield asymptotically poor performance.
- 2. Basic coding principles. Avoid optimization blockers so that a compiler can generate efficient code.
  - (a) Eliminate excessive function calls. Move computations out of loops when possible. Consider selective compromises of program modularity to gain greater efficiency.
  - (b) Eliminate unnecessary memory references. Introduce temporary variables to hold intermediate results. Store a result in an array or global variable only when the final value has been computed.
- 3. Low-level optimizations.
  - (a) Try various forms of pointer versus array code.
  - (b) Reduce loop overhead by unrolling loops.
  - (c) Find ways to make use of the pipelined functional units by techniques such as iteration splitting.